Predicting the target of visual search from eye fixation (gaze) data is achallenging problem with many applications in human-computer interaction. Incontrast to previous work that has focused on individual instances as a searchtarget, we propose the first approach to predict categories and attributes ofsearch targets based on gaze data. However, state of the art models forcategorical recognition, in general, require large amounts of training data,which is prohibitive for gaze data. To address this challenge, we propose anovel Gaze Pooling Layer that integrates gaze information into CNN-basedarchitectures as an attention mechanism - incorporating both spatial andtemporal aspects of human gaze behavior. We show that our approach is effectiveeven when the gaze pooling layer is added to an already trained CNN, thuseliminating the need for expensive joint data collection of visual and gazedata. We propose an experimental setup and data set and demonstrate theeffectiveness of our method for search target prediction based on gazebehavior. We further study how to integrate temporal and spatial gazeinformation most effectively, and indicate directions for future research inthe gaze-based prediction of mental states.
展开▼